For a quick overview on what are meta-analysis (MA), please watch this 3 minutes video:
Meta-analyses (MAs) can be useful for two main purposes: (1) theory building and evaluation and (2) practical decisions during study design. This section starts with some basics on why single studies might not be as reliable as an MA and then explains how exactly MAs overcome this problem.
When thinking about development, we often wonder which abilities infants display at what age. To this end, we regularly look at published studies that set up smart experiments to test whether infants have specific abilities, for example whether infants treat native vowels differently from non-native ones, and when that ability develops (for more details on this specific topic see http://inphondb.acristia.org).
When trying to solve the puzzle of language acquisition, it is crucial to have a clear picture of infants’ abilities, and when evaluating an existing theory, we want to know whether new studies confirm or contrast with predictions made. However, the results of one single experiment does not allow us to directly conclude something about their underlying abilities: Each experiment measures behavior of a set of infants in a very specific situation, which might not be generalizable to other situations. Moreover, there might be a measurement error in this one-time snapshot of reality.
The biggest problem when consulting single published studies is the false positive, along with the two most common causes: practices that increase the chance of a significant p-value and biases.
Every study we run has (at least) a 5% chance of telling us that infants can do something when this is not true, according to the significance threshold alpha (commonly set to .05). p-values that fall under this threshold are supposed to tell us that the results we observed are not very likely due to chance.
This likelihood becomes bigger when researchers are victims of their own biases or engage in seemingly innocent and possibly common practices that increase the chance of a false positive. None of these necessarily come with bad intentions and a sense of wrongdoing, so it is worth discussing them briefly.
Let’s start with a very strong motivation for obtaining positive results, independent of them being true or false: journal publications. Articles are the key to being considered a smart and valuable scientist. Journals, especially the big names, want to publish new, exciting, and sometimes surprising findings! So all the incentives right now push researchers to obtaining that wonderfully significant p-value. People are trying to change this but nonetheless common practice is still this way.
Increasing the chance of a false positive can be due to a number of practices, such as analyzing the same dataset in multiple ways (t-Test, ANOVA, collapsing or splitting groups, etc etc) and only reporting (the single) one significant outcome without correcting; or rejecting participants based on the dependent variable (e.g., excluding all infants that did not look at the named picture in a task that tests infants’ ability to recognize object labels).
Biases are nearly omnipresent in human cognition. When we expect something to be true, we are unconsciously prone to either make it happen (think self-fulfilling prophecies) or to perceive the world so that it aligns with our expectations. Biases also might affect researchers: Random patterns can become important results (it turns out girls are able to solve this task, but boys are not!) or after months and months of running the experiment and digging through the data the original hypothesis and analysis plan are lost and the significant finding (obtained through the practices mentioned in the previous paragraph) was the thing we were looking for all along. Biases can even go so far as to make us unconsciously influence participants or results when we know the condition (for example being much more lenient with babies’ looking on target when the “correct” trials are presented and being much more strict for the “incorrect” trials; looking times will systematically differ in this case). This last bias is often avoided by blinding the researcher to what is going on, but the other two are harder to avoid and require conscious efforts such as pre-registering analysis plans.
Some further reading on biases: * https://explorable.com/research-bias * http://www.nature.com/news/how-scientists-fool-themselves-and-how-they-can-stop-1.18517
Collecting many study results from different researchers is a way to try and make up for the possibility that biases influenced the outcome. We can even use MAs to check for biases, such as asking whether a suspicious number of p-values is just below the significance threshold or whether results are systematically skewed in one direction. Why biases matter is wonderfully illustrated here: http://www.alltrials.net/news/the-economist-publication-bias/. Checking for biased results is a whole literature on its own, but as a start tools such as p-curving apps are easily available for every researcher. http://www.p-curve.com/ or http://shinyapps.org/apps/p-checker/ are two well-documented examples.
In addition to the probability to obtain false positives, there is also the possibility to be unable to measure an effect despite it being there. This issue has two implications. First, not being able to replicate a study might not mean that some previous finding is a false positive, but it could point to noisy measures, small effects (what effects are and how we measure them will be explained below in section 2), and consequently low power. This means that a typical infant study which tests, often at great cost (both in personal investment and with respect to money), between 12 and 24 infants per condition, might not be able to reliably pick up an effect even though it is truly present when the phenomenon in question leads only to a small change in the dependent variable.
We often do not know about these non-significant findings because it is quite difficult to publish them. But they might happen to all of us. MAs help us in experiment design so we can avoid false negatives due to low power. When the size of an effect is known and with a fixed significance threshold, calculating power is straightforward. Here is a simulation of how all ingredients fit together: <rpsychologist.com/d3/NHST/>.
To increase power and make up for small effects, we need to test more babies. But we do not want to needlessly spend time and money in the lab, so finding a balance is important. To this end, it is very useful to have a good idea of the effect size in question, based on a MA. Once we have the effect size, we can calculate the minimal number of babies needed to be able to observe an existing effect with sufficiently high probability (usually 80%).
MAs can also help with study design, especially when they come in the format of a CAMA (community-augmented MA), which codes many, many design variables. Examples include the words or sounds used, how long trials were, etc. Instead of doing a tiresome literature review, the most common procedure or the one associated with the biggest effect can be obtained by looking at the data in a CAMA, which come with detailed instructions on how to download and inspect their contents as simple spreadsheets.
In MAs we express the outcome of a single experiment in a way that captures how big an effect is and how much it varies. There are 3 groups of effect sizes: (1) effect sizes based on means, which includes Cohen’s d on which we focus from here on; (2) effect sizes based on binary data; and (3) effect sizes based on correlations. Since most developmental studies in the lab use mean responses of two groups or of the same infant in two (or more) conditions, Cohen’s d is the appropriate effect size measure. In this chapter 3 and the following ones, a gentle introduction to effect sizes is provided. Cohen’s d is based on standardized mean differences. To get a feel for Cohen’s d we highly recommend to play with the visualization of RPsychologist. A list of recommended readings is also provided at the end of this document. In a typical infant study, babies might hear two types of trials and the responses to each are compared. In most papers, it is sufficient that the difference between the trial types reaches statistical significance, but in a meta-analyses we care about the size of this single observed effect and its variance. This allows us to pool over several studies, weigh each datapoint, and arrive at an estimate of the underlying, true effect. This then allows us to calculate power and check how effect sizes might be systematically affected by variables such as infant age in “moderator analyses” (see http://metalab.stanford.edu).
Recommended further readings for an introduction to effect sizes 1. Textbooks are great to get a basic overview of how to calculate effect sizes. We consulted: Lipsey, M. W. & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA: Sage. 2. A great primer and a spreadsheet document to calculate effect sizes by hand can be found via: D. Lakens. (2013). Calculating and Reporting Effect Sizes to Facilitate Cumulative Science: A Practical Primer for t-tests and ANOVAs . Frontiers in Psychology 4:863. Material: osf.io/ixgcd/files/ 3. Since textbooks do not cover every possible question that different meta-analysts may encounter, we turned to articles for more specific questions. We found this article useful for considering the comparability of effect sizes from within- and between-participant designs: Morris, S. B., & DeShon, R. P. (2002). Combining Effect Size Estimates in Meta-Analysis With Repeated Measures and Independent-Groups Designs. Psychological Methods, 7(1), 1805-125. doi: 10.1037//1082-989X.7.1.105 4.Borenstein, M., Hedges, L. V. ,Higgins, J. P. T., Rothstein, H. R. (2009). Introduction to Meta-Analysis. John Wiley & Sons. DOI: 10.1002/9780470743386.ch3
Choose the appropriate level of detail for your MA topic. The topic of your meta-analysis should be broader than the one of a single experiment (e.g. “How do babies segment words of different stress patterns?”), but narrower than a whole research field (e.g. “How do babies learn language?”). The goal is to be able to gather comparable papers, measuring consistant dependant variables, to allow you to compute a common statistical metric (i.e. effect size) from them.
Define your population of interest precisely. Homogeneous can mean many things; age, language, typical versus atypical. You may run a meta-analysis where you accept many different levels for some of the variables and see how it affects results, defining them as MA moderators, for example seeing if effects are consistent across ages. There should still be some unifying element in your studies though so you have one broad result of your meta-analysis.
Consider the number of available studies on your topic. Your MA topic also depends on how many studies have been done on it. If you want to run a simple comparative MA, as few as two studies could be okay. But if you want to run an analysis with a lot of moderators, 5 studies probably isn’t enough to warrant a meta-analysis.
It is important that you build traceability of your work from the start, particularly since in larger MAs other people may finish up the work or you want to check later on why you decided to exclude a given paper. So to make sure that all of your decisions are recorded and clear, make a copy of this decision spreadsheet. Don’t forget to rename it, to give us a “viewing” link, and clean it up as follows.
Step 1: Click on “File” and select the “Make a copy…” option
Step 2: In the window that appears, change the name to something like “MA_TOPIC”
Step 3: Click on the blue button “Share” on the top right.
Step 4: In the ensuing menu, click on “Get shareable link” on the top right
Step 5: Copy the link and send it to us.
Step 6: Clean up
The model spreadsheet contains some fake entries and notes. Our recommendation is, so as not to get confused, to remove the instructions found on the top lines of each sheet and the fake information that is already entered - except for a couple of exceptions: the pink columns (A, B and W) in the Relevant_studies_search sheet contain formulas that may be useful to you. So you might want to delete the contents of the other columns and keep those two in order to reuse the formulas.
************
Additionally, make a copy of this flowchart, rename and share as you did for your spreadsheet above. This figure gives you an overview of the process, and you will be filling in the boxes with the right numbers as you go along so that people who continue this MA and/or those interested in assessing this work can make sure that you followed the procedure.
Probably not.In this step, you will go through the initial list you put together in step 2, and make decisions to include/exclude papers, mostly based on the abstract. In addition to creating your sample for data entry (step 4) you will start honing your inclusion criteria. Typically, these will include:
a homogeneous scientific question: > Make sure you have clearly > defined the purview of e.g. cross-situational learning (e.g., this > name itself is vague to those outside the domain, so define it in > a more specific way: “exposure to sets of images paired with > wordforms with the goal of studying word-form image association, > but crucially multiple images are shown at once (unlike e.g. the > switch procedure)”)
a homogeneous infant population: > Typically-developing children, > between the ages of XX and YY (the precise ages may stem from your > seminal paper; perhaps to start with, you could set the maximum to > 36 months, the minimum to 0 months); consider whether you also > need to restrict the sample based on infants’ native language on > theoretical reasons
The last one is perhaps the trickiest. Staying close to your seminal paper will allow you to reduce the amount of variation in your sample due to methodological “details”, and to make it easier for yourself to enter data, because all the results will be structured in similar ways. But it’s important to know that this is a potential source of bias. For instance, you could decide that you will only input data using a specific kind of artificial language because you know that papers not using this language have smaller effects. This will end up being a self-confirmation exercise – unless there are a priori strong theoretical reasons to exclude other kinds of language /// to assume that the learning algorithms attributed to the infant cannot be generalized to these other languages.
Every time you make a decision regarding these and other key criteria, remember to note it in your decision spreadsheet, in the last sheet called “Notes_inclusion”. For example, mine looks like this:
| Question | Decision | Date |
|---|---|---|
| a homogeneous scientific question | learning of speech sound categories, where the categories are represented by a multimodal versus unimodal distribution of acoustic correlates | 10/19/2015 |
| a homogeneous infant population | typically-developing children, between the ages of 0 and 36 months | 10/19/2015 |
| a homogeneous procedure | passive exposure in the lab, testing via any behavioral or non-behavioral method | 10/19/2015 |
The goal of this step is to put together a list of publications that you will look at (abstract only in step 3, and for a subset of those in full in step 4) and consider for inclusion. In a typical MA, you make the most comprehensive list possible in order to answer a specific research question and/or to cover a given phenomenon. This typically means going through 1,000 abstracts, and reading in full 100 papers. You can start with the seminal paper for your effect of interest, and then look for the studies citing your seminal one. Use pubmed’s search to find your pivot study’s entry, for instance by copy-pasting the full paper title in the builder:
When you press “search”, usually you’ll find the entry for your seminal paper (or if the title was not unique, you might need to click on one of the entries found until you do come across the entry for your seminal paper). Notice on the right a section entitled “Cited by …” Scroll down to click on the link at the bottom of this section stating “See all..”
You will now see all studies citing your seminal one. Constrain it further by clicking on “Show additional filters” on the left, and checking the box for “Infant: birth to 23 months”:
You now want to save all these papers in your reference management software. If you use Zotero: Click on the drawing of a folder in your status/search bar. When you do so, a window will pop up with all the results for that pubmed page:
Click on “select all” and “OK”. Repeat for the other search pages. This will store the citation information, including abstract, in zotero.
You can also interrogate Pubmed with a script, such as the one we have prepared.
What if the title and abstract doesn’t allow me to decide?
Then play it safe and include the paper to check based on the full text.
What if the title and abstract doesn’t allow me to decide, but in fact I know the paper and I know it needs to be excluded?
Then you probably have already seen the full text of the paper, so say “yes” for the screening decision, and then “no” for the full-text decision.
Ideally, you would enter everything: published or unpublished, proceedings or journal, etc. However, sometimes you may want to start a “seed” meta-analysis that just gives a rough idea of an area.
In this case, how large should your sample be? Mika and Molly have done some simulations to help you decide. By and large, it looks like the more, the better - clearly estimates get more precise (confidence intervals narrow) as more papers are entered. Based on this information, we are proposing a minimum of 10 included experiments as a pragmatic first step, knowing that your estimate is not very precise.
We are hoping that eventually all of these MAs may be included in MetaLab, so we ask you to use the MA template (create a copy, as you did in step 1), and follow the field specifications. Ideally, you would code all potentially relevant moderator variables (e.g., experimental manipulations) in addition to the core characteristics (columns in red; e.g. means). However, in the interest of time, you can get started with the core characteristics only. Remember once more to give us viewing rights (see step 1 for instructions).
One of my papers has a single experiment but involves both Spanish and English speakers who are tested on a native and a nonnative speech sound contrast. Should that count as 4 experiments (2 languages x 2 contrasts)?
How many rows you make depends on how the results are reported. In this case, the authors report the outcome separately for all four groups. Therefore, please enter the four groups separately; each into their own row. You can copy over descriptions of the experiment.
In Experiment 1, there are two age groups. Do I have to report the age for both groups or do I average both groups into one? If I have to report both groups, how do I report this in the input form?
How many rows you make depends on how the results are reported. In this case, the authors report an average outcome over both age groups, since they did not find a significant difference between the two groups. Therefore, please enter only one row and calculate the average age. If the results were reported separately per age group, make a
In a typical full MA, you go through the whole list and only then start entering. The procedure is as follows. Go back to your spreadsheet, and for each study that has been decided as a “yes” during screening, try to retrieve the full text for the paper as you normally would (e.g., search through scholar.google.com; regular google; your institution’s library, etc.) If you cannot retrieve it, update your spreadsheet sheet Relevant_studies_search to mark this paper as “no” in column F entitled “Fulltext_retrieved”. If you want, you can contact the authors to try to get the full text from them, in which case you can note this on column G.
If you do find the full text, go through the paper to find the first experiment reported. You will enter all experiments and conditions one at a time, and fill in their information in the MA spreadsheet you created in step 4.
IMPORTANT: You should work backwards from the results section: look at what dependent measures are reported fully enough that you will be able to extract an effect size from them.
The following information allows one to calculate an effect size (we are sticking to experimental designs, since most of our MAs are experimental):
between-participant studies: > Means and SDs (not SEs!) of the > dependent variable for each infant group** are all that is > required for the calculation of Cohen’s d. Sometimes, means and > SDs are not available as numbers. If there are clear figures, you > can try to estimate means and SDs using** this online > app*. If you decide to > estimate values from figures, add a column to keep track of this. > Finally, t or F values for the main effect in combination with > sample sizes can be used to calculate Cohen’s d. Note them > when available.
within-participant studies: > Effect sizes for this type of > study are calculated the same way as in between-participant > studies, but in order to calculate the weight of these studies > the correlation between the first and second measurements is > required (to account for the amount of > within-participant variation). Since this measure is usually not > reported, we provide below median and range for correlations found > in existing MAs.
Infant word segmentation from native speech: 0.641 (range: 0.140 to 0.921)
Infant vowel discrimination (native and nonnative): 0.496 (range: -0.413 to 0.855)
When entering papers, please remember a key thing: all analyses are done by machines, and machines cannot read text! So if a column is “numeric”, please do not enter things that aren’t numbers (such as text, spaces, ~, etc). This is particularly important for the dependent measures!
At this stage, you might find that a given paper does not contain the right information for being included. In this case, you can and should exclude it. If you have already started entering it, you can leave the information you entered and put in “comments” that the entry is incomplete (although if you followed our advice above, you won’t have wasted time entering it!). Remember to update your spreadsheet with each paper you read and made a decision on.
The article I enter has 3 experiments, and the first is with adult participants. Do I need to enter this experiment?
No, please only enter the infant/child experiments
The sound stimuli differ approx. 6 ms in length, but the experiment is not about length differences. Do I have to report this difference although it is very small?
In case there’s a column for stimulus length, please report it. You are right that this experiment is not about length differences, but having the information cannot hurt, and eventual analyses will reflect that the difference is very small.
The article reports a table with the lengths of each individual stimulus. Should I calculate and report the average value?
Yes, please report the averaged value in the appropriate column.
I am entering an article with the HAS method. The authors report results for both the 2 and 4 minutes after the test phase has started. Your example only reports the results after 2 minutes, but would you still want me to report both?
It is often the case that articles report more than one type of result. Please just report the ones that we also provide in the example file!
We use R to calculate effect sizes. Visit https://github.com/langcog/metalab2 for our code.
We recommend the following for an introduction to effect sizes:
1. Textbooks are great to get a basic overview of how to calculate effect sizes. We consulted: Lipsey, M. W. & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA: Sage.
2. A great primer and a spreadsheet document to calculate effect sizes by hand can be found via: D. Lakens. (2013). Calculating and Reporting Effect Sizes to Facilitate Cumulative Science: A Practical Primer for t-tests and ANOVAs . Frontiers in Psychology 4:863. Material: osf.io/ixgcd/files/
3. Since textbooks do not cover every possible question that different meta-analysts may encounter, we turned to articles for more specific questions. We found this article useful for considering the comparability of effect sizes from within- and between-participant designs: Morris, S. B., & DeShon, R. P. (2002). Combining Effect Size Estimates in Meta-Analysis With Repeated Measures and Independent-Groups Designs. Psychological Methods, 7(1), 1805-125. doi: 10.1037//1082-989X.7.1.105
Two groups of infants are tested and I treat them as two different entries, but the number of included and excluded infants are only reported as a whole over both groups. What do I do?
As the best approximation we can get, please divide the reported number through the number of groups (in your case 2).
The age of infants is reported in weeks, therefore I multiplied it with 7 to convert it into days. I read in your instructions that you have to multiply months with 30.42 to get a proxy for days. So my question is whether I have to multiply with a different number than 7 to get a proxy for days?
No, that’s fine the way you did it!
In some cases you will still need to contact the authors of the study. People probably don’t know you, so think about what in the object would make you open an email from a stranger. Something like “including your paper in a MA” should be motivational. People are busy: they don’t have time to read lengthy email, especially from someone they don’t know, so be as concise as possible. You could still give them more details later if they ask for it. Don’t be shy, authors are likely to be happy to hear that someone is interested in their work and is going to cite them!
Interrogating PubMed via a script
Example MAs: InWordDB
Instructions for creating a CAMA (including further resources)
We welcome researchers interested in contributing to Metalab. Please contact us at metalab-project@googlegroups.com
Contributions can take various forms:
New meta-analyses looking for contributors
Currently, the following topics are in the process of becoming a meta-analysis and we are looking for contributors:
Resources
Here are a few resources on creating meta-analyses compatible with Metalab:
If you have already done a meta-analysis, you can easily add it to MetaLab. This tutorial explains you how.
Contributing to MetaLab can have several advantages both for you and the community: * Get more visibility for your MA. When publishing a paper, you wish that it will be read by as many people as possible. Placing it in a centralized repository such as MetaLab can help you to reach this broad audience and gives more visibility to your meta-analysis. * Increase the impact of your MA. As an author, you probably want your results to be fully understood by readers, and you want your readers to use your data as efficiently as possible. The interactive interface of the MetaLab website allows readers to better navigate in your meta-analysis results than the paper version, and to play with the results to better use them when planning experiments. MetaLab is an opportunity for your meta-analysis to make a stronger impact. * Contribute to drawing the broader developmental picture. You made a meta-analysis to help the community draw a clearer picture about an effect of interest and contribute to theory assessment. MetaLab is a central platform that includes over 1040 effect sizes. Incorporating your meta-analysis in this larger dataset helps the community to have a better idea of cognitive development and language acquisition.
You remain the owner of your meta-analysis data: Users must cite your data by your preferred citation. If your data are previously unpublished then this doesn’t count as publication. Learn more by reading our full citation policy.
You can retain control for as long as you want to. In fact, two options exists for the curation and review of your data. You can choose to be the curator. This means you agree to be the person responsible for identifying new relevant papers and signaling them to the MetaLab data manager, who will add them to the database of the relevant MA. You would be expected to check data entry once in a while. Curators are part of the MetaLab board and get informed of discussions regarding e.g. site revamping. Alternatively, you can choose to step down completely, and it will be MetaLab’s job to assign a new curator for your dataset. In this case, we can still keep your photo on the wall of fame.
We prepared a spreadsheet that you must use to code your data. The key property of this spreadsheet is that it has one row for each effect size. Therefore you should fill as many rows as the number of effect size you report in your meta-analysis. * The first tab, called “Data”, should contain your data. * The second tab, called “CodeBook”, contains all the explanations about the codes to be used when you fill the “Data” tab. * The third tab, called “Methods”, contains all the possible options for the “method” column and their respective description.
If you work from the appendix of your paper: 1. Create a tab for each appendix and copy-paste your appendix there. Check that the content of your appendix hasn’t been messed up when pasting. 2. Edit the data to have one row per effect size. 3. If you have several appendix create another tab and copy paste the content of the appendix-specific tabs side-by-side to check that all lines match up. 4. Once you have synthesized all your appendix in one table, copy-paste the column corresponding to a MetaLab requirement in the “Data” tab and edit each cell to match the data format specified in the “CodeBook” tab.
Write the two references in each of the first three columns (study_ID, long_cite, short_cite), separated by commas. The “short cite columns” should be in the text citation format, i.e. Smith (2002, 2008), if the two papers are from the same authors. Fill the other columns as usual.
Fill the missing columns with “NA”. If it happens that you don’t have one of the mandatory columns, please let us know.
Most journal articles are peer-reviewed; some conference proceedings (e.g., Cognitive Science) are peer-reviewed. Typically, book chapters, posters, and conference abstract are not considered peer-reviewed because no reviewer has seen the full details of the methods.